Comparing Word Relatedness Measures Based on Google $n$-grams
نویسندگان
چکیده
Estimating word relatedness is essential in natural language processing (NLP), and in many other related areas. Corpus-based word relatedness has its advantages over knowledge-based supervised measures. There are many corpus-based measures in the literature that can not be compared to each other as they use a different corpus. The purpose of this paper is to show how to evaluate different corpus-based measures of word relatedness by calculating them over a common corpus (i.e., the Google n-grams) and then assessing their performance with respect to gold standard relatedness datasets. We evaluate six of these measures as a starting point, all of which are re-implemented using the Google n-gram corpus as their only resource, by comparing their performance in five different data sets. We also show how a word relatedness measure based on a web search engine can be implemented using the Google n-gram corpus.
منابع مشابه
TrWP: Text Relatedness using Word and Phrase Relatedness
Text is composed of words and phrases. In bag-of-word model, phrases in texts are split into words. This may discard the inner semantics of phrases which in turn may give inconsistent relatedness score between two texts. TrWP , the unsupervised text relatedness approach combines both word and phrase relatedness. The word relatedness is computed using an existing unsupervised co-occurrence based...
متن کاملA Computationally Efficient Measure for Word Semantic Relatedness Using Time Series
Measurement of words semantic relatedness plays an important role in a wide range of natural language processing and information retrieval applications, such as full-text search, summarization, classification and clustering. In this paper, we propose an easy to implement and low-cost method for estimating words semantic relatedness. The proposed method is based on the utilization of words tempo...
متن کاملWikiRelate! Computing Semantic Relatedness Using Wikipedia
Wikipedia provides a knowledge base for computing word relatedness in a more structured fashion than a search engine and with more coverage than WordNet. In this work we present experiments on using Wikipedia for computing semantic relatedness and compare it to WordNet on various benchmarking datasets. Existing relatedness measures perform better using Wikipedia than a baseline given by Google ...
متن کاملDistributed Distributional Similarities of Google Books over the Centuries
This paper introduces a distributional thesaurus and sense clusters computed on the complete Google Syntactic N-grams, which is extracted from Google Books, a very large corpus of digitized books published between 1520 and 2008. We show that a thesaurus computed on such a large text basis leads to much better results than using smaller corpora like Wikipedia. We also provide distributional thes...
متن کاملInferring Selectional Preferences from Part-Of-Speech N-grams
We present the PONG method to compute selectional preferences using part-of-speech (POS) N-grams. From a corpus labeled with grammatical dependencies, PONG learns the distribution of word relations for each POS N-gram. From the much larger but unlabeled Google N-grams corpus, PONG learns the distribution of POS N-grams for a given pair of words. We derive the probability that one word has a giv...
متن کامل